bold italics
R markdown is known as literate programming- mroe elaborate way to write everything down
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
source("functions.r")
Downloaded a file into R
download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv", destfile = "data/gapminder-FiveYearData.csv")
gapminder<- read.csv("data/gapminder-FiveYearData.csv")
head(gapminder)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
what is the life expectancy of those years
p <- ggplot(data=gapminder,aes(x=year,y=lifeExp)) +
geom_point()
p
let’s make it interactive
ggplotly(p)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
If you are repeating yourself in your code, you may be able to solve that problem by making your own function!
newfunctionname<- function(argument1, argument2){ arg1+arg2 }
example here is standard error – sd= standard dev sqrt= suqare root length- gives the sameple size bc counts how many are in the sample
on the right, there will be a new section with functions that will display se (above, we have loaded in an R script with functions in it so dont need to load them everytime separately- kind of like installing a package, but it will show up in the environment) roxygen package (look into?)
se<- function(x){
sd(x)/sqrt(length(x))
}
try it on data: 1. make a data set 2. se(dataset)
cars<- c(3,4,5,6,7,10)
se(cars)
## [1] 1.013794
dplyrYou will likely want to get subsections of your dataframe and/or calculate means of a variable for a certain subsection, dplyr is your friend! – (put into single quotes for a function in R) learn to select columns from a dataframe with a-d columns : select(data.frame, a,c)
can also exlude a subsection of data select(data.frame, -a,-c)
look at the names by names() or row.names()
gapminder <- read.csv("data/gapminder-FiveYearData.csv")
year_country_gdp <- select(gapminder, year, country, gdpPercap)
year_country_gdp<- select(gapminder,-pop, -continent, -lifeExp)
names(year_country_gdp)
## [1] "country" "year" "gdpPercap"
then we want to filter this is the same as select but for rows can use logical vectors as arguments can use pipes to filter %>% this is a pipe its saying i want all of the before to be the first argument of the filter (so dont need to retype in the filter section– dont need gapminder$continent if you filter this way, also building layers)
year_country_gdp_euro <- select(gapminder, year, country, gdpPercap) %>% filter(continent==“Europe”)
–use ctrl+shift+M for a shortcut to %>%
the above wont work bc removed continent but hten looking for continent
add a period in the first argument location bc we already specified part of the dataframe (which is always the first argument, so we essentially leave it blank)–> select(.,year, country, gdpPercap)
year_country_gdp_euro <- gapminder %>%
filter(continent=="Europe") %>%
select(.,year, country, gdpPercap)
the eqivalent without pipes would be: euro<- filter (gapminder, continent=“Europe”) year_country_gdp_euro<- select (euro,year,country,gdpPercap)
need to create an intermediate function of “euro” - order is very important reason to use pipes is not to rewrite the files over and over
exploring the amaxzing ‘group_by’ and ‘summarize’ functions
groupby– alows you to take one big dataframe and separate them and do functions separately do suboperattions separately with summarize use group_by summarize together
summarize- will make a new column mean_gdp- new name of the column =mean(gdpPercap)- the value within this new column
add a new column with standard error of this gdp
mean_gdp_percountry<- gapminder %>%
group_by(country) %>%
summarize(mean_gdp=mean(gdpPercap), se_gdp=se(gdpPercap))
mean_gdp_percountry
## # A tibble: 142 x 3
## country mean_gdp se_gdp
## <fctr> <dbl> <dbl>
## 1 Afghanistan 802.6746 31.23550
## 2 Albania 3255.3666 344.20223
## 3 Algeria 4426.0260 378.26190
## 4 Angola 3607.1005 336.56641
## 5 Argentina 8955.5538 537.68144
## 6 Australia 19980.5956 2256.11315
## 7 Austria 20411.9163 2787.23968
## 8 Bahrain 18077.6639 1563.29518
## 9 Bangladesh 817.5588 67.86165
## 10 Belgium 19900.7581 2422.32683
## # ... with 132 more rows
task: get mean, se, and sample size for lifeExp by continent
can get the sample size 2 ways: length(continent) or n() (can be blank inside, built in function to ‘diplyr’)
mean_life_percontinent<- gapminder %>%
group_by(continent) %>%
summarize(mean_life_expectancy=mean(lifeExp), se_life_expectancy=se(lifeExp), sample_size=n())
mean_life_percontinent
## # A tibble: 5 x 4
## continent mean_life_expectancy se_life_expectancy sample_size
## <fctr> <dbl> <dbl> <int>
## 1 Africa 48.86533 0.3663016 624
## 2 Americas 64.65874 0.5395389 300
## 3 Asia 60.06490 0.5962151 396
## 4 Europe 71.90369 0.2863536 360
## 5 Oceania 74.32621 0.7747759 24
can group by multiple things here, added by country to continent
mean_life_percontinent<- gapminder %>%
group_by(continent,country) %>%
summarize(mean_life_expectancy=mean(lifeExp), se_life_expectancy=se(lifeExp), sample_size=n())
mean_life_percontinent
## # A tibble: 142 x 5
## # Groups: continent [?]
## continent country mean_life_expectancy
## <fctr> <fctr> <dbl>
## 1 Africa Algeria 59.03017
## 2 Africa Angola 37.88350
## 3 Africa Benin 48.77992
## 4 Africa Botswana 54.59750
## 5 Africa Burkina Faso 44.69400
## 6 Africa Burundi 44.81733
## 7 Africa Cameroon 48.12850
## 8 Africa Central African Republic 43.86692
## 9 Africa Chad 46.77358
## 10 Africa Comoros 52.38175
## # ... with 132 more rows, and 2 more variables: se_life_expectancy <dbl>,
## # sample_size <int>
combine diplyr with ggplot (select is rows, filter is column) the pipes move it to ggplot, whatevercame before the pipe is moved later
euro_countries <- gapminder %>%
filter(continent== "Europe") %>%
ggplot(aes(x=year,y=lifeExp, color=country)) +geom_line() + facet_wrap(~country)
euro_countries
tidyrR likes to have ‘long’ format data where every row is an observation and you have a single column for ‘observations’ the others serve to identify that observation. (exceptions apply when you have multiple types of observations) To switch back and forth from ‘wide’ (how we typically enter data in a spreadsheet) to ‘long’ use tidyr